I am a PhD student in Economic history and want to learn R to be able to use it in my research. The course seems some what challenging for me, but I am quite ecxited to learn these new skills!
My Github repository is https://github.com/hshirvon/IODS-project
My course diary is at: https://hshirvon.github.io/IODS-project/
date()
## [1] "Sun Dec 5 14:16:45 2021"
This week I have been learning simple linear regression.
date()
## [1] "Sun Dec 5 14:16:45 2021"
1. First, I will read the students2014 data into R from my local folder and explore the structure and the dimensions of the data + also trying out a few other functions.
heididata <- read.table("data/heidindata.csv", sep=",", header=TRUE)
#Explore the structure and the dimensions of the data
str(heididata)
## 'data.frame': 166 obs. of 8 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ sukupuoli : chr "F" "M" "F" "M" ...
## $ lrn14.Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ lrn14.Attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ lrn14.deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ lrn14.stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ lrn14.surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ lrn14.Points : int 25 12 24 10 22 21 21 31 24 26 ...
dim(heididata)
## [1] 166 8
head(heididata)
## X sukupuoli lrn14.Age lrn14.Attitude lrn14.deep lrn14.stra lrn14.surf
## 1 1 F 53 3.7 3.583333 3.375 2.583333
## 2 2 M 55 3.1 2.916667 2.750 3.166667
## 3 3 F 49 2.5 3.500000 3.625 2.250000
## 4 4 M 53 3.5 3.500000 3.125 2.250000
## 5 5 M 49 3.7 3.666667 3.625 2.833333
## 6 6 F 38 3.8 4.750000 3.625 2.416667
## lrn14.Points
## 1 25
## 2 12
## 3 24
## 4 10
## 5 22
## 6 21
The dataset “heididata” includes 8 variables and 166 observations. It is a subdata of an international survey of Approaches to Learning, which has been collected 3.12.2014 - 10.1.2015. The variables included in the “heididata” are gender, age, attitude, deep learning, strategic learning, surface learning, points and a running variable X.
2. Next, I will show a graphical overview of the data and show summaries of the variables in the data. I will also describe and interpret the outputs.
plot(heididata)
summary(heididata)
## X sukupuoli lrn14.Age lrn14.Attitude
## Min. : 1.00 Length:166 Min. :17.00 Min. :1.400
## 1st Qu.: 44.25 Class :character 1st Qu.:21.00 1st Qu.:2.600
## Median : 87.50 Mode :character Median :22.00 Median :3.200
## Mean : 90.13 Mean :25.51 Mean :3.143
## 3rd Qu.:136.75 3rd Qu.:27.00 3rd Qu.:3.700
## Max. :183.00 Max. :55.00 Max. :5.000
## lrn14.deep lrn14.stra lrn14.surf lrn14.Points
## Min. :1.583 Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:3.333 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.667 Median :3.188 Median :2.833 Median :23.00
## Mean :3.680 Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:4.083 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.917 Max. :5.000 Max. :4.333 Max. :33.00
The age of the respondents is between 17 and 55; attitude approximately 1,4 - 5,0; deep learning approximately 1.6 - 4,9; strategic learning approximately 1,3 - 5,0; surface learning approximately 1,6 - 4,3 and points 7 - 33.
3. My regression model explains the target (dependent) variable, exam points, with gender, age and attitude.
my_regression <- lm(lrn14.Points ~ sukupuoli + lrn14.Age + lrn14.Attitude , data = heididata)
#Print out a summary of my regression model
summary(my_regression)
##
## Call:
## lm(formula = lrn14.Points ~ sukupuoli + lrn14.Age + lrn14.Attitude,
## data = heididata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.4590 -3.3221 0.2186 4.0247 10.4632
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.42910 2.29043 5.863 2.48e-08 ***
## sukupuoliM -0.33054 0.91934 -0.360 0.720
## lrn14.Age -0.07586 0.05367 -1.414 0.159
## lrn14.Attitude 3.60657 0.59322 6.080 8.34e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.315 on 162 degrees of freedom
## Multiple R-squared: 0.2018, Adjusted R-squared: 0.187
## F-statistic: 13.65 on 3 and 162 DF, p-value: 5.536e-08
The regression model explains the target (dependent) variable, exam points, with gender, strategic learning and attitude. Attitude has a statistically significant relationship with Points, but gender and age do not.
Next, I will run the regression again with just attitude as the explanatory variable and print out a summary of the regression model:
my_regression2 <- lm(lrn14.Points ~ lrn14.Attitude , data = heididata)
summary(my_regression2)
##
## Call:
## lm(formula = lrn14.Points ~ lrn14.Attitude, data = heididata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## lrn14.Attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
4. Since gender and age did not have a statistically significant effect, I will analyze the summary of the my_regression2 with attitude as the explanatory variable.
The coefficient is 3.5255, meaning that when the explanatory variable (attitude) moves one unit, the target variable moves 3.5255 units. One unit better attitude, should result in 3.5255 more points!
The multiple R squared of the model is 0.1906. This means that approximately 19.1% the variation in points is explained by the variation in attitude.
5
In linear regression model we need to assume linearity as well as that the errors are normally distributed, the errors are not correlated, the errors have constant variance, the size of a given error does not depend on the explanatory variables. A boxplot or probability plot of the residuals can be useful in checking for symmetry and specifically the normality of the error terms in the regression model.
I am using the linear model object “my_regression” with several explanatory variables that was created earlier, I will produce the following diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage.
#Let's place the following 4 graphics to the same plot:
par(mfrow = c(2,2))
#Calling the plot function with 1 (Residuals vs Fitted values), 2 (Normal QQ-plot) & 5 (Residuals vs Leverage).
plot(my_regression,which=c(1,2,5))
First, let’s look at the residuals against the fitted values of the response variable. If the variability of the residuals appears to increase with the size of the fitted values, a transformation of the response variable prior to fitting is indicated. In my case, the residuals vs. fitted values looks pretty much what is expected to confirm that the fitted model is appropriate. Seems that assumption of constant variance is justified.
the QQ-plot of the residuals provides a method to explore the assumption that the errors are normally distributed. In the there is a small dip in the beginning and in the end which might cause a small worry for the the assumption that the errors are normally distributed. Still, the plot shows a reasonable enough fit of errors with the straight line.
Residuals vs. leverage plot can help identify which observations have an unusually high impact. Residuals vs. leverage does not look very balanced. Some observations have an unusually large impact.
#Heidi Hirvonen 18.11.2020
1. First, I created the Rmd file, Chapter3 and added it as a child file in the index.Rmd.
2. Next, I will read the joined student alcohol consumption data into R from https://github.com/rsund/IODS-project/raw/master/data/alc.csv
joined_data <- read.csv("https://github.com/rsund/IODS-project/raw/master/data/alc.csv", sep=",", header=TRUE)
colnames(joined_data)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "n" "id.p" "id.m"
## [31] "failures" "paid" "absences" "G1" "G2"
## [36] "G3" "failures.p" "paid.p" "absences.p" "G1.p"
## [41] "G2.p" "G3.p" "failures.m" "paid.m" "absences.m"
## [46] "G1.m" "G2.m" "G3.m" "alc_use" "high_use"
## [51] "cid"
3. Next I will analyze the relationships between high/low alcohol consumption and some of the other variables in the data.
My assumption is that high alcohol use could negatively affect current health status and also be connected to a higher number of school absences.
Again, I think that family educational support & extra-curricular activities would negatively affect high alcohol use.
4. First, I will access the tidyverse libraries tidyr, dplyr and ggplot2. Then I will numerically and graphically explore the distributions of my chosen variables and alcohol use (crosstabulations, barplots & boxplots)
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
glimpse(joined_data)
## Rows: 370
## Columns: 51
## $ school <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP",…
## $ sex <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F",…
## $ age <int> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15,…
## $ address <chr> "R", "R", "R", "R", "R", "R", "R", "R", "R", "U", "U", "U",…
## $ famsize <chr> "GT3", "GT3", "GT3", "GT3", "GT3", "GT3", "GT3", "LE3", "LE…
## $ Pstatus <chr> "T", "T", "T", "T", "T", "T", "T", "T", "T", "A", "A", "T",…
## $ Medu <int> 1, 1, 2, 2, 3, 3, 3, 2, 3, 3, 4, 1, 1, 1, 1, 1, 2, 2, 2, 3,…
## $ Fedu <int> 1, 1, 2, 4, 3, 4, 4, 2, 1, 3, 3, 1, 1, 1, 2, 2, 1, 2, 3, 2,…
## $ Mjob <chr> "at_home", "other", "at_home", "services", "services", "ser…
## $ Fjob <chr> "other", "other", "other", "health", "services", "health", …
## $ reason <chr> "home", "reputation", "reputation", "course", "reputation",…
## $ guardian <chr> "mother", "mother", "mother", "mother", "other", "mother", …
## $ traveltime <int> 2, 1, 1, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 1, 1, 1, 3, 1, 2, 1,…
## $ studytime <int> 4, 2, 1, 3, 3, 3, 3, 2, 4, 4, 2, 1, 2, 2, 2, 2, 3, 4, 1, 2,…
## $ schoolsup <chr> "yes", "yes", "yes", "yes", "no", "yes", "no", "yes", "no",…
## $ famsup <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
## $ activities <chr> "yes", "no", "yes", "yes", "yes", "yes", "no", "no", "no", …
## $ nursery <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "no"…
## $ higher <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
## $ internet <chr> "yes", "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes…
## $ romantic <chr> "no", "yes", "no", "no", "yes", "no", "yes", "no", "no", "n…
## $ famrel <int> 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 5, 5, 3, 3,…
## $ freetime <int> 1, 3, 3, 3, 2, 3, 2, 1, 4, 3, 3, 3, 3, 4, 3, 2, 2, 1, 5, 3,…
## $ goout <int> 2, 4, 1, 2, 1, 2, 2, 3, 2, 3, 2, 3, 2, 2, 2, 3, 2, 2, 1, 2,…
## $ Dalc <int> 1, 2, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1,…
## $ Walc <int> 1, 4, 1, 1, 3, 1, 2, 3, 3, 1, 1, 2, 3, 2, 1, 2, 1, 1, 1, 1,…
## $ health <int> 1, 5, 2, 5, 3, 5, 5, 4, 3, 4, 1, 4, 4, 5, 5, 1, 4, 3, 5, 3,…
## $ n <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ id.p <int> 1096, 1073, 1040, 1025, 1166, 1039, 1131, 1069, 1070, 1106,…
## $ id.m <int> 2096, 2073, 2040, 2025, 2153, 2039, 2131, 2069, 2070, 2106,…
## $ failures <int> 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2,…
## $ paid <chr> "yes", "no", "no", "no", "yes", "no", "no", "no", "no", "no…
## $ absences <int> 3, 2, 8, 2, 5, 2, 0, 1, 9, 10, 0, 3, 2, 0, 4, 1, 2, 6, 2, 1…
## $ G1 <int> 10, 10, 14, 10, 12, 12, 11, 10, 16, 10, 14, 10, 11, 10, 12,…
## $ G2 <int> 12, 8, 13, 10, 12, 12, 6, 10, 16, 10, 14, 6, 11, 12, 12, 14…
## $ G3 <int> 12, 8, 12, 9, 12, 12, 6, 10, 16, 10, 15, 6, 11, 12, 12, 14,…
## $ failures.p <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ paid.p <chr> "yes", "no", "no", "no", "yes", "no", "no", "no", "no", "no…
## $ absences.p <int> 4, 2, 8, 2, 2, 2, 0, 0, 6, 10, 0, 6, 2, 0, 6, 0, 0, 4, 4, 2…
## $ G1.p <int> 13, 13, 14, 10, 13, 11, 10, 11, 15, 10, 15, 11, 13, 12, 13,…
## $ G2.p <int> 13, 11, 13, 11, 13, 12, 11, 10, 15, 10, 14, 12, 12, 12, 12,…
## $ G3.p <int> 13, 11, 12, 10, 13, 12, 12, 11, 15, 10, 15, 13, 12, 12, 13,…
## $ failures.m <int> 1, 2, 0, 0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3,…
## $ paid.m <chr> "yes", "no", "yes", "yes", "yes", "yes", "no", "yes", "no",…
## $ absences.m <int> 2, 2, 8, 2, 8, 2, 0, 2, 12, 10, 0, 0, 2, 0, 2, 2, 4, 8, 0, …
## $ G1.m <int> 7, 8, 14, 10, 10, 12, 12, 8, 16, 10, 14, 8, 9, 8, 10, 16, 1…
## $ G2.m <int> 10, 6, 13, 9, 10, 12, 0, 9, 16, 11, 15, 0, 10, 11, 11, 15, …
## $ G3.m <int> 10, 5, 13, 8, 10, 11, 0, 8, 16, 11, 15, 0, 10, 11, 11, 15, …
## $ alc_use <dbl> 1.0, 3.0, 1.0, 1.0, 2.5, 1.0, 2.0, 2.0, 2.5, 1.0, 1.0, 1.5,…
## $ high_use <lgl> FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE,…
## $ cid <int> 3001, 3002, 3003, 3004, 3005, 3006, 3007, 3008, 3009, 3010,…
gather(joined_data) %>% glimpse
## Rows: 18,870
## Columns: 2
## $ key <chr> "school", "school", "school", "school", "school", "school", "sch…
## $ value <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP"…
#Produce summary statistics of high_use and health
joined_data %>% group_by(high_use) %>% summarise(count = n(),mean(health))
## # A tibble: 2 x 3
## high_use count `mean(health)`
## <lgl> <int> <dbl>
## 1 FALSE 259 3.49
## 2 TRUE 111 3.73
# initialize a plot of high_use and health
a1 <- ggplot(joined_data, aes(x = high_use, y = health))
# define the plot as a boxplot and draw it
a1 + geom_boxplot()+ xlab("high_use") + ylab("health")
Second, let’s take a look at high alcohol use and absences. - The connection between absences and high alcohol use are as I expected. Mean absences for the students with high alcohol use is 6,38 days and ones with less alcohol use is 3,71 days.
#Produce summary statistics of high_use and absences
joined_data %>% group_by(high_use) %>% summarise(count = n(),mean(absences))
## # A tibble: 2 x 3
## high_use count `mean(absences)`
## <lgl> <int> <dbl>
## 1 FALSE 259 3.71
## 2 TRUE 111 6.38
# initialize a plot of high_use and absences
a2 <- ggplot(joined_data, aes(x = high_use, y = absences))
# define the plot as a boxplot and draw it
a1 + geom_boxplot() + xlab("high_use") + ylab("absences")
#Produce summary statistics of family support and high alcohol use
joined_data %>% group_by(famsup) %>% summarise(count = n(),mean(high_use))
## # A tibble: 2 x 3
## famsup count `mean(high_use)`
## <chr> <int> <dbl>
## 1 no 139 0.324
## 2 yes 231 0.286
# initialize a plot of high_use and health
a3 <- ggplot(joined_data, aes(x = famsup, y = high_use))
# define the plot as a boxplot and draw it
a3 + geom_boxplot() + xlab("family support") + ylab("high_use")
#Produce summary statistics of extra curricular activities and high alcohol use
joined_data %>% group_by(activities) %>% summarise(count = n(),mean(high_use))
## # A tibble: 2 x 3
## activities count `mean(high_use)`
## <chr> <int> <dbl>
## 1 no 179 0.330
## 2 yes 191 0.272
# initialize a plot of high_use and absences
a4 <- ggplot(joined_data, aes(x = activities, y = high_use))
# define the plot as a boxplot and draw it
a4 + geom_boxplot() + xlab("activities")+ ylab("high_use")
5. Next, I am using logistic regression to statistically explore the relationship between my chosen variables and the binary high/low alcohol consumption variable as the target variable. I will present and interpret the coefficients of the models as odds ratios and provide confidence intervalls for them
#regression
m <- glm(high_use ~ health + absences + famsup + activities, data = joined_data , family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ health + absences + famsup + activities,
## family = "binomial", data = joined_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5664 -0.8251 -0.7171 1.2578 1.9729
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.48808 0.39040 -3.812 0.000138 ***
## health 0.14099 0.08613 1.637 0.101637
## absences 0.08975 0.02315 3.877 0.000106 ***
## famsupyes -0.23892 0.24092 -0.992 0.321337
## activitiesyes -0.29591 0.23480 -1.260 0.207580
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 429.59 on 365 degrees of freedom
## AIC: 439.59
##
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept) health absences famsupyes activitiesyes
## -1.48807647 0.14098501 0.08974717 -0.23892472 -0.29590959
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.2258066 0.1027826 0.4768908
## health 1.1514074 0.9747528 1.3673371
## absences 1.0938977 1.0475837 1.1472586
## famsupyes 0.7874742 0.4913683 1.2657321
## activitiesyes 0.7438547 0.4683710 1.1776531
As we can see from the regression coefficients, family support and extra curricular activities are negatively correlated with high alcohol use. On the other hand, good current health status and absences are positively correlated with high alcohol use. The only statistically signifigant result is the connection between high alcohol use and absences.
The knowledge I have on the data is not enough to make clear causal interpretations, although it might be tempting to claim that family support and extra curricular activities might lead to less alcohol consumption.
It is also not very surprising that absences and high alcohol consumption are posilively correlated, but it is more difficult to say which one is the cause and which one is the effect (or if there is some thirs variable causing both of these)
To me, the most surprising result is the fact that high alcohol consumption and good current health status are positively correlated. It might be that the possible health hazards of high alcohol use do not show, because the respondents are so young. I would still have thought that high alcohol use would have been connected with worse mental health also on young people. So I guess it all comes down to how the health status is measures. It might be that the youngsters with bad health do not want to risk it more by using alcohol.
6. Using the variables which according to my regression model had a statistical relationship with high/low alcohol consumption, I will explore the predictive power of my model. I will provide a 2X2 crosstabulation of predictions versus the actual values and optionally display a graphic visualizing both the actual values and the predictions.
# fit the model
m <- glm(high_use ~ health + absences + famsup + activities , data = joined_data, family = "binomial")
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
# add the predicted probabilities to 'joined_data'
joined_data <- mutate(joined_data, probability = probabilities)
# use the probabilities to make a prediction of high_use
joined_data <- mutate(joined_data, prediction = probability > 0.5)
# see the last ten original classes, predicted probabilities, and class predictions
select(joined_data, sex, age, goout, address, high_use, probability, prediction) %>% tail(10)
## sex age goout address high_use probability prediction
## 361 M 18 3 R TRUE 0.3924820 FALSE
## 362 M 18 3 R TRUE 0.2090412 FALSE
## 363 M 18 1 R TRUE 0.3535075 FALSE
## 364 M 18 4 U TRUE 0.2417650 FALSE
## 365 M 18 2 U FALSE 0.3109075 FALSE
## 366 M 18 3 U TRUE 0.3304539 FALSE
## 367 M 18 2 U FALSE 0.3136411 FALSE
## 368 M 19 4 R TRUE 0.3400464 FALSE
## 369 M 19 4 R TRUE 0.3803344 FALSE
## 370 M 19 2 R FALSE 0.3136411 FALSE
# tabulate the target variable versus the predictions
table(high_use = joined_data$high_use, prediction = joined_data$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 251 8
## TRUE 98 13
1. First, I created the Rmd file, Chapter4 and added it as a child file in the index.Rmd.
2. I will load the Boston dataset from the MASS package and descirbe it briefly.
# access the MASS package
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# load the data
data("Boston")
# explore the dataset
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
The Boston dataset is a dataset of Housing Values in Suburbs of Boston. It includes 14 variables and 506 observations The 14 variables in the dataset are:
crim, per capita crime rate by town.
zn, proportion of residential land zoned for lots over 25,000 sq.ft.
indus, proportion of non-retail business acres per town.
chas, Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox, nitrogen oxides concentration (parts per 10 million).
rm, average number of rooms per dwelling.
age, proportion of owner-occupied units built prior to 1940.
dis, weighted mean of distances to five Boston employment centres.
rad, index of accessibility to radial highways.
tax, full-value property-tax rate per $10,000.
ptratio, pupil-teacher ratio by town.
black, 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
lstat, lower status of the population (percent).
medv, median value of owner-occupied homes in $1000s.
3.I will show a graphical overview of the data and show summaries of the variables in the data.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
#summaries are interpreted along with appropriate graphics of all the continuous variables
boxplot(Boston$crim)
boxplot(Boston$zn)
boxplot(Boston$indus)
boxplot(Boston$nox)
boxplot(Boston$rm)
boxplot(Boston$age)
boxplot(Boston$dis)
boxplot(Boston$tax)
boxplot(Boston$ptratio)
boxplot(Boston$black)
boxplot(Boston$lstat)
boxplot(Boston$medv)
From the summaries we get an overview of the variables, here are some interesting observations:
library(corrplot)
## corrplot 0.92 loaded
library(tidyr)
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(digits = 2)
# print the correlation matrix
cor_matrix
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## black lstat medv
## crim -0.39 0.46 -0.39
## zn 0.18 -0.41 0.36
## indus -0.36 0.60 -0.48
## chas 0.05 -0.05 0.18
## nox -0.38 0.59 -0.43
## rm 0.13 -0.61 0.70
## age -0.27 0.60 -0.38
## dis 0.29 -0.50 0.25
## rad -0.44 0.49 -0.38
## tax -0.44 0.54 -0.47
## ptratio -0.18 0.37 -0.51
## black 1.00 -0.37 0.33
## lstat -0.37 1.00 -0.74
## medv 0.33 -0.74 1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)
From the correlation matrix we can see that there is a strong negative correlation between index of accessibility to radial highways(rad) and high full-value property-tax rate per $10,000 (tax)
There is also a strong negative correlation between weighted mean of distances to five Boston employment centres (dis) nitrogen oxides concentration (nox), proportion of owner-occupied units built prior to 1940 (age) & proportion of non-retail business acres per town (indus).
I am interested in education so I want to take a closer look at how the - pupil-teacher ratio by town is correlated with the other variables. It seems that there is a somewhat strong positive correlation with:
tax, full-value property-tax rate per $10,000.(0.46)
rad, index of accessibility to radial highways.(0.46)
indus, proportion of non-retail business acres per town (0.38)
lstat, lower status of the population (percent) (0.37)
and a negative correlation with:
medv, median value of owner-occupied homes in $1000s. (-0.51)
rm, average number of rooms per dwelling (-0.36)
zn, proportion of residential land zoned for lots over 25,000 sq.ft.(-0.39)
An interesting observation is that a high pupil teacher ratio is connected to a higher proportion of lower status of the population and a lower median value of owner-occupied homes.
4. Next, I will standardize the dataset and print out summaries of the scaled data. As we can see, standardization will set the mean of all the variables to 0.
# center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
Next, let’s create a categorical variable of the crime rate in the Boston dataset using the scaled crime rate and quantiles as the break points. Let’s also drop the old crime rate variable from the dataset and divide the dataset to train and test sets, so that 80% of the data belongs to the train set.
# summary of the scaled crime rate
summary(boston_scaled$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
# look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
#Dividing the dataset to train and test sets, so that 80% of the data belongs to the train set
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
5. Fitting the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. I will draw the LDA (bi)plot.
# MASS and train are available
library(MASS)
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2450495 0.2673267 0.2524752 0.2351485
##
## Group means:
## zn indus chas nox rm age
## low 0.83913588 -0.9159953 -0.07348562 -0.8579241 0.44093723 -0.8547685
## med_low -0.07395223 -0.3098346 -0.01714665 -0.5317042 -0.14884184 -0.2916667
## med_high -0.38551208 0.2052409 0.26805724 0.4397051 0.08223166 0.4833755
## high -0.48724019 1.0172655 -0.10655643 1.0547392 -0.38320288 0.8367112
## dis rad tax ptratio black lstat
## low 0.8643576 -0.6848939 -0.7275608 -0.40870614 0.38055742 -0.75840133
## med_low 0.3041043 -0.5416255 -0.4494346 -0.06970336 0.32117082 -0.10835980
## med_high -0.3918093 -0.4143934 -0.3115151 -0.33675813 0.08141243 0.03577426
## high -0.8454898 1.6366336 1.5129868 0.77903654 -0.77486148 0.86724390
## medv
## low 0.511966351
## med_low -0.006184599
## med_high 0.143751297
## high -0.716835023
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.10344323 0.54036155 -0.917174066
## indus 0.02601035 -0.30941056 0.247437368
## chas -0.08451638 -0.08768227 0.008555455
## nox 0.31829819 -0.80262644 -1.330811166
## rm -0.11158690 -0.08498676 -0.222976384
## age 0.30633999 -0.45551162 -0.077716451
## dis -0.06895773 -0.24742481 0.001623111
## rad 3.18911016 0.73956755 -0.273212527
## tax -0.03476271 0.28903932 0.647981848
## ptratio 0.12498423 -0.01237725 -0.258661336
## black -0.13347209 0.03439243 0.134180843
## lstat 0.21748282 -0.13299327 0.444920753
## medv 0.20817474 -0.34225261 -0.100315815
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9460 0.0418 0.0122
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
6. Next, I will save the crime categories from the test set and then remove the categorical crime variable from the test dataset. Then predict the classes with the LDA model on the test data. Cross tabulate the results with the crime categories from the test set.
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 19 8 1 0
## med_low 4 13 1 0
## med_high 0 10 13 1
## high 0 0 0 32
The LDA model fitted using 80% of the original data set predicts the categories of the crime variable quite well. In the case of high the prediction is best.
7. Next I will reload the Boston dataset and standardize it. Then I will calculate the distances between the observations and run k-means algorithm on the dataset.
I will also investigate what is the optimal number of clusters and run the algorithm again. I will isualize the clusters with the pairs() & ggpairs() functions, where the clusters are separated with colors) and interpret the results.
# load the data again
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
dim(Boston)
## [1] 506 14
#standardizing the data
boston_scaled2 <- scale(Boston)
boston_scaled2 <- as.data.frame(boston_scaled2)
summary(boston_scaled2)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# euclidean distance matrix
dist_eu <- dist(boston_scaled2)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled2, method = 'manhattan')
# look at the summary of the distances
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
# k-means clustering
km <- kmeans(boston_scaled2, centers = 3)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)
# Boston dataset is available
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
# visualize the results
library (ggplot2)
qplot(x = 1:k_max, y = twcss, geom = 'line')
# k-means clustering
km <-kmeans(Boston, centers = 2)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)
1. Showing a graphical overview of the data and show summaries of the variables in the data.
#human <- read.csv("data/human.csv", header=TRUE)
#summary(human)
human_ <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt", sep=",", header=TRUE)
summary(human_)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(dplyr)
library (corrplot)
# visualize the 'human_' variables
ggpairs(human_)
# compute the correlation matrix and visualize it with corrplot
cor(human_) %>% corrplot
There are large differences in the variables between different countries. Here are some observations:
There is a positive correlation between expected education and life expectancy also a positive correlation between adolescent birthrate and maternal mortality
There is a negative correlation between maternal mortality and life expectancy There is a negative correlation between maternal mortality and expected education There is also a negative correlation between adolescent birthrate and life expectancy There is also a negative correlation between adolescent birthrate and expected education
We can note that higher female secondary education is positively correlatad with things like educational expectation, life expectacy and GNI and negatively correlated with adolescent birthrate and maternal mortality.
2. Next, I will perform principal component analysis (PCA) on the not standardized human data. I will show the variability captured by the principal components and draw a biplot displaying the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis), along with arrows representing the original variables.
# print out summaries of the human_ variables
summary(human_)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_)
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
3. Next, I will standardize the variables in the human_ data and repeat the above analysis.
The standardization will set the mean of the variables to zero thus creating a different biplot.
# standardize the variables
human_std <- scale(human_)
# print out summaries of the standardized variables
summary(human_std)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7378 Min. :-2.7188
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6782 1st Qu.:-0.6425
## Median : 0.3503 Median : 0.2316 Median : 0.1140 Median : 0.3056
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.7126 3rd Qu.: 0.6717
## Max. : 2.6646 Max. : 1.6632 Max. : 2.4730 Max. : 1.4218
## GNI Mat.Mor Ado.Birth Parli.F
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_std)
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2)
4. Based on the biplot drawn after PCA on the standardized human data we can note the following:
Female labourforce participation and female parliament representation are pointing to the same direction as the second principal component so they contributre to that dimension.
All the other features are pointing to the direction of the first principal component, thus contributing to that dimension.
5. I will load the tea dataset from the package FactoMineR and explore the data briefly.
# access the Factominer, ggplot2, dplyr and tidyr packages
library(FactoMineR)
library(ggplot2)
library(dplyr)
library(tidyr)
# load the tea data and exploring the data
data("tea")
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
summary(tea)
## breakfast tea.time evening lunch
## breakfast :144 Not.tea time:131 evening :103 lunch : 44
## Not.breakfast:156 tea time :169 Not.evening:197 Not.lunch:256
##
##
##
##
##
## dinner always home work
## dinner : 21 always :103 home :291 Not.work:213
## Not.dinner:279 Not.always:197 Not.home: 9 work : 87
##
##
##
##
##
## tearoom friends resto pub
## Not.tearoom:242 friends :196 Not.resto:221 Not.pub:237
## tearoom : 58 Not.friends:104 resto : 79 pub : 63
##
##
##
##
##
## Tea How sugar how
## black : 74 alone:195 No.sugar:155 tea bag :170
## Earl Grey:193 lemon: 33 sugar :145 tea bag+unpackaged: 94
## green : 33 milk : 63 unpackaged : 36
## other: 9
##
##
##
## where price age sex
## chain store :192 p_branded : 95 Min. :15.00 F:178
## chain store+tea shop: 78 p_cheap : 7 1st Qu.:23.00 M:122
## tea shop : 30 p_private label: 21 Median :32.00
## p_unknown : 12 Mean :37.05
## p_upscale : 53 3rd Qu.:48.00
## p_variable :112 Max. :90.00
##
## SPC Sport age_Q frequency
## employee :59 Not.sportsman:121 15-24:92 1/day : 95
## middle :40 sportsman :179 25-34:69 1 to 2/week: 44
## non-worker :64 35-44:40 +2/day :127
## other worker:20 45-59:61 3 to 6/week: 34
## senior :35 +60 :38
## student :70
## workman :12
## escape.exoticism spirituality healthy
## escape-exoticism :142 Not.spirituality:206 healthy :210
## Not.escape-exoticism:158 spirituality : 94 Not.healthy: 90
##
##
##
##
##
## diuretic friendliness iron.absorption
## diuretic :174 friendliness :242 iron absorption : 31
## Not.diuretic:126 Not.friendliness: 58 Not.iron absorption:269
##
##
##
##
##
## feminine sophisticated slimming exciting
## feminine :129 Not.sophisticated: 85 No.slimming:255 exciting :116
## Not.feminine:171 sophisticated :215 slimming : 45 No.exciting:184
##
##
##
##
##
## relaxing effect.on.health
## No.relaxing:113 effect on health : 66
## relaxing :187 No.effect on health:234
##
##
##
##
##
#The tea dataset contains 36 variables and 300 observations
# visualize the dataset
boxplot(tea[,0:18], main='Multiple Box plots')
boxplot(tea$age)
boxplot (tea[,20:36], main='Multiple Box plots')
Next, I will create a new data set “tea_time” and do Multiple Correspondence Analysis (MCA) on it and comment on the outputs.
# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
# select the 'keep_columns' to create a new dataset
#tea_time <- select(tea, one_of(keep_columns))
# look at the summaries and structure of the data
#summary(tea_time)
#str(tea_time)
# visualize the dataset
#gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
# multiple correspondence analysis
#mca <- MCA(tea_time, graph = FALSE)
# summary of the model
#summary(mca)
# visualize MCA
#plot(mca, invisible=c("ind"))
#plot(mca, invisible=c("var"))
#adjusting the code by adding argument habillage = "quali". This will differentiate the different variables with different colors, making the visualization easier to interpret.
#plot(mca, invisible=c("ind"), habillage = "quali")
# multiple correspondence analysis also visualizes the analysis by default
#mca <- MCA(tea_time)
(more chapters to be added similarly as we proceed with the course!)